reward vector
Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?
Multi-objective bandits have attracted increasing attention because of their broad applicability and mathematical elegance, where the reward of each arm is a multi-dimensional vector rather than a scalar. This naturally introduces Pareto order relations and Pareto regret. A long-standing question in this area is whether performance is fundamentally harder to optimize because of this added complexity. A recent surprising result shows that, in the adversarial setting, Pareto regret is no larger than classical regret; however, in the stochastic setting, where the regret notion is different, the picture remains unclear. In fact, existing work suggests that Pareto regret in the stochastic case increases with the dimensionality. This controversial yet subtle phenomenon motivates our central question: \emph{are multi-objective bandits actually harder than single-objective ones?} We answer this question in full by showing that, in the stochastic setting, Pareto regret is in fact governed by the maximum sub-optimality gap \(g^\dagger\), and hence by the minimum marginal regret of order \(Ω(\frac{K\log T}{g^\dagger})\). We further develop a new algorithm that achieves Pareto regret of order \(O(\frac{K\log T}{g^\dagger})\), and is therefore optimal. The algorithm leverages a nested two-layer uncertainty quantification over both arms and objectives through upper and lower confidence bound estimators. It combines a top-two racing strategy for arm selection with an uncertainty-greedy rule for dimension selection. Together, these components balance exploration and exploitation across the two layers. We also conduct comprehensive numerical experiments to validate the proposed algorithm, showing the desired regret guarantee and significant gains over benchmark methods.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Texas (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Thompson Sampling for Multi-Objective Linear Contextual Bandit
Park, Somangchan, Ann, Heesang, Oh, Min-hwan
We study the multi-objective linear contextual bandit problem, where multiple possible conflicting objectives must be optimized simultaneously. We propose \texttt{MOL-TS}, the \textit{first} Thompson Sampling algorithm with Pareto regret guarantees for this problem. Unlike standard approaches that compute an empirical Pareto front each round, \texttt{MOL-TS} samples parameters across objectives and efficiently selects an arm from a novel \emph{effective Pareto front}, which accounts for repeated selections over time. Our analysis shows that \texttt{MOL-TS} achieves a worst-case Pareto regret bound of $\widetilde{O}(d^{3/2}\sqrt{T})$, where $d$ is the dimension of the feature vectors, $T$ is the total number of rounds, matching the best known order for randomized linear bandit algorithms for single objective. Empirical results confirm the benefits of our proposed approach, demonstrating improved regret minimization and strong multi-objective performance.
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.66)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Texas (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Learning to Lead: Incentivizing Strategic Agents in the Dark
Wu, Yuchen, Zhong, Xinyi, Yang, Zhuoran
The principal-agent model (Ross, 1973; Grossman and Hart, 1992; Smith, 2004; Laffont and Martimort, 2009) is a fundamental framework for understanding decision-making processes with misaligned incentives and information asymmetry, with wide applications across various disciplines such as economics, finance, and computer science (Ratliff et al., 2018; Kamenica, 2012). In this model, the principal represents an entity such as a service provider, a policy maker, or a firm, whose objective is to maximize certain system-level outcomes, such as revenue, social welfare, or efficiency. On the other hand, an agent, who could be a customer, an employee, or an individual participant, aims to optimize his utility based on his private preferences or information, which is not directly observable by the principal. To induce the optimal outcomes, the principal designs and commits to a mechanism, which could be a contract, an incentive scheme, or a policy, that aligns the agent's incentives with the principal's objectives. The optimal mechanism and the agent's optimal strategy against it constitute the equilibrium of the principal-agent model, in certain settings also known as the Stackelberg equilibrium (Stackelberg, 1934, 2010).
- Europe > Kosovo > District of Gjilan > Kamenica (0.24)
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Leisure & Entertainment > Games (1.00)
- Health & Medicine (0.67)
Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF
Reinforcement Learning with Human Feedback (RLHF) is a widely used fine-tuning approach that aligns machine learning model, particularly Language Model (LM) with human preferences. There are typically multiple objectives driving the preference, hence humans find it easier to express per-objective comparisons rather than a global preference between two choices. Multi-Objective RLHF (MORLHF) aims to use per-objective preference feedback and achieve Pareto optimality among these objectives by aggregating them into a single unified objective for optimization. However, nearly all prior works rely on linear aggregation, which rules out policies that favor specific objectives such as the worst one. The only existing approach using non-linear aggregation is computationally expensive due to its reward-based nature and the need for retraining whenever the aggregation parameters change. In this work, we address this limitation by transforming the non-linear aggregation maximization problem into a series of sub-problems. Each sub-problem involves only linear aggregation, making it computationally efficient to solve. We further extend our framework to handle multi-group scenarios, where each group has distinct weights for the objectives. Our method enables achieving consensus or maximizing the aggregated objective across all groups. Theoretically, we demonstrate that our algorithmic framework achieves sublinear regret and can be easily adapted to a reward-free algorithm. Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > Middle East > Jordan (0.04)
On the Optimality of Dilated Entropy and Lower Bounds for Online Learning in Extensive-Form Games
Fan, Zhiyuan, Kroer, Christian, Farina, Gabriele
First-order methods (FOMs) are arguably the most scalable algorithms for equilibrium computation in large extensive-form games. To operationalize these methods, a distance-generating function, acting as a regularizer for the strategy space, must be chosen. The ratio between the strong convexity modulus and the diameter of the regularizer is a key parameter in the analysis of FOMs. A natural question is then: what is the optimal distance-generating function for extensive-form decision spaces? In this paper, we make a number of contributions, ultimately establishing that the weight-one dilated entropy (DilEnt) distance-generating function is optimal up to logarithmic factors. The DilEnt regularizer is notable due to its iterate-equivalence with Kernelized OMWU (KOMWU) -- the algorithm with state-of-the-art dependence on the game tree size in extensive-form games -- when used in conjunction with the online mirror descent (OMD) algorithm. However, the standard analysis for OMD is unable to establish such a result; the only current analysis is by appealing to the iterate equivalence to KOMWU. We close this gap by introducing a pair of primal-dual treeplex norms, which we contend form the natural analytic viewpoint for studying the strong convexity of DilEnt. Using these norm pairs, we recover the diameter-to-strong-convexity ratio that predicts the same performance as KOMWU. Along with a new regret lower bound for online learning in sequence-form strategy spaces, we show that this ratio is nearly optimal. Finally, we showcase our analytic techniques by refining the analysis of Clairvoyant OMD when paired with DilEnt, establishing an $\mathcal{O}(n \log |\mathcal{V}| \log T/T)$ approximation rate to coarse correlated equilibrium in $n$-player games, where $|\mathcal{V}|$ is the number of reduced normal-form strategies of the players, establishing the new state of the art.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Texas (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Leisure & Entertainment > Games (1.00)
- Education > Educational Setting > Online (0.70)
SEAL: Systematic Error Analysis for Value ALignment
Revel, Manon, Cargnelutti, Matteo, Eloundou, Tyna, Leppert, Greg
Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values, namely feature imprint, alignment resistance and alignment robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them - a metric we term feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to perturbed inputs. Our experiments, utilizing open-source components like the Anthropic/hh-rlhf preference dataset and OpenAssistant RMs, reveal significant imprints of target features and a notable sensitivity to spoiler features. We observed a 26% incidence of alignment resistance in portions of the dataset where LM-labelers disagreed with human preferences. Furthermore, we find that misalignment often arises from ambiguous entries within the alignment dataset. These findings underscore the importance of scrutinizing both RMs and alignment datasets for a deeper understanding of value alignment.
- Law (0.67)
- Health & Medicine (0.67)
Explaining an Agent's Future Beliefs through Temporally Decomposing Future Reward Estimators
Towers, Mark, Du, Yali, Freeman, Christopher, Norman, Timothy J.
Future reward estimation is a core component of reinforcement learning agents; i.e., Q-value and state-value functions, predicting an agent's sum of future rewards. Their scalar output, however, obfuscates when or what individual future rewards an agent may expect to receive. We address this by modifying an agent's future reward estimator to predict their next N expected rewards, referred to as Temporal Reward Decomposition (TRD). This unlocks novel explanations of agent behaviour. Through TRD we can: estimate when an agent may expect to receive a reward, the value of the reward and the agent's confidence in receiving it; measure an input feature's temporal importance to the agent's action decisions; and predict the influence of different actions on future rewards. Furthermore, we show that DQN agents trained on Atari environments can be efficiently retrained to incorporate TRD with minimal impact on performance.
- Europe > United Kingdom > England > Hampshire > Southampton (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)